If all you have is a bit of the Bible: Learning POS taggers for truly low-resource languages
نویسندگان
چکیده
We present a simple method for learning part-of-speech taggers for languages like Akawaio, Aukan, or Cakchiquel – languages for which nothing but a translation of parts of the Bible exists. By aggregating over the tags from a few annotated languages and spreading them via wordalignment on the verses, we learn POS taggers for 100 languages, using the languages to bootstrap each other. We evaluate our cross-lingual models on the 25 languages where test sets exist, as well as on another 10 for which we have tag dictionaries. Our approach performs much better (20-30%) than state-of-the-art unsupervised POS taggers induced from Bible translations, and is often competitive with weakly supervised approaches that assume high-quality parallel corpora, representative monolingual corpora with perfect tokenization, and/or tag dictionaries. We make models for all 100 languages available.
منابع مشابه
If You Even Don't Have a Bit of Bible: Learning Delexicalized POS Taggers
Part-of-speech (POS) induction is one of the most popular tasks in research on unsupervised NLP. Various unsupervised and semisupervised methods have been proposed to tag an unseen language. However, many of them require some partial understanding of the target language because they rely on dictionaries or parallel corpora such as the Bible. In this paper, we propose a different method named de...
متن کاملEmpirical Gaussian priors for cross-lingual transfer learning
Sequence model learning algorithms typically maximize log-likelihood minus the norm of the model (or minimize Hamming loss + norm). In cross-lingual part-ofspeech (POS) tagging, our target language training data consists of sequences of sentences with word-by-word labels projected from translations in k languages for which we have labeled data, via word alignments. Our training data is therefor...
متن کاملInducing Multilingual Text Analysis Tools Using Bidirectional Recurrent Neural Networks
This work focuses on the rapid development of linguistic annotation tools for resource-poor languages. We experiment several cross-lingual annotation projection methods using Recurrent Neural Networks (RNN) models. The distinctive feature of our approach is that our multilingual word representation requires only a parallel corpus between the source and target language. More precisely, our metho...
متن کاملسیستم برچسب گذاری اجزای واژگانی کلام در زبان فارسی
Abstract: Part-Of-Speech (POS) tagging is essential work for many models and methods in other areas in natural language processing such as machine translation, spell checker, text-to-speech, automatic speech recognition, etc. So far, high accurate POS taggers have been created in many languages. In this paper, we focus on POS tagging in the Persian language. Because of problems in Persian POS t...
متن کاملYou Say Periklute, I Say Paraclete: Towards a Reconciliation Between the Bible and the Quran
The Quranic statement that Jesus predicted Muhammad by name is examined in light of the expectation of what the “kingdom of God” was. The concept of the kingdom of God as being the light or fire of the Holy Spirit at Pentecost is contrasted with the Sufi concept of the “Light of Muhammad.” Pentecost could be the Light of Muhammad coming upon the apostles of Christ; the Light is the same, but kn...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2015